A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 50 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.
The data on the 50 customers in the sample is as follows:
Answer the following questions using complete sentences and attach all output, plots, etc. within this report.
For this assignment, ignore the categorical variables (gender, smoker, cities)
Perform univariate analyses on the quantitative variables (center, shape, spread). Include descriptive statistics, and histograms. Be sure to use terms discussed in class such as bimodal, skewed left, etc.
str(Insurance)
## tibble [50 × 9] (S3: tbl_df/tbl/data.frame)
## $ Charges : num [1:50] 9145 7441 12143 3260 19023 ...
## $ Age : num [1:50] 52 45 60 31 39 25 25 57 34 42 ...
## $ BMI : num [1:50] 36.7 30.2 25.7 20.4 18.3 ...
## $ Female : num [1:50] 0 0 0 0 1 1 1 1 0 0 ...
## $ Children : num [1:50] 0 1 0 0 5 1 0 2 1 2 ...
## $ Smoker : num [1:50] 0 0 0 0 1 0 1 0 1 0 ...
## $ WinterSprings: num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ WinterPark : num [1:50] 0 0 1 0 0 1 0 0 0 1 ...
## $ Oviedo : num [1:50] 1 1 0 1 1 0 1 1 0 0 ...
Insurance$Female <- NULL
Insurance$WinterPark <- NULL
Insurance$WinterSprings <- NULL
Insurance$Oviedo <- NULL
Insurance$Smoker <- NULL
Insurance %>%
tbl_summary(statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}"),
all_categorical() ~ "{n} / {N} ({p}%)"),
type = all_continuous() ~ "continuous2"
)
| Characteristic | N = 501 |
|---|---|
| Charges | |
| Mean (SD) | 12,142 (11,317) |
| Median (IQR) | 8,333 (4,360, 13,720) |
| Range | 2,494, 55,135 |
| Age | |
| Mean (SD) | 42 (13) |
| Median (IQR) | 40 (30, 53) |
| Range | 23, 64 |
| BMI | |
| Mean (SD) | 28.7 (5.6) |
| Median (IQR) | 28.0 (25.2, 32.2) |
| Range | 16.8, 42.1 |
| Children | |
| 0 | 17 / 50 (34%) |
| 1 | 14 / 50 (28%) |
| 2 | 12 / 50 (24%) |
| 3 | 6 / 50 (12%) |
| 5 | 1 / 50 (2.0%) |
| 1 n / N (%) | |
plot_ly(x = Insurance$Age, type = "histogram", alpha = 0.6) %>%
layout(title = 'Distribution of Age',
xaxis = list(title = 'Age of the primary beneficiary'),
yaxis = list(title = 'Count'))
plot_ly(x = Insurance$Children, type = "histogram", alpha = 0.6) %>%
layout(title = 'Distribution of Children',
xaxis = list(title = 'Number of children covered by health insurance plan (includes other dependents as well)'),
yaxis = list(title = 'Count'))
plot_ly(x = Insurance$BMI, type = "histogram", alpha = 0.6) %>%
layout(title = 'Distribution of BMI',
xaxis = list(title = 'Primary beneficiary’s body mass index (kg/m2)'),
yaxis = list(title = 'Count'))
ggp1 <- ggplot(Insurance, aes(Insurance$Age)) +
geom_histogram(binwidth = 2 , col = 'black', fill = 'darkblue', alpha = 0.75)+
labs(title = "Distribution of Primary Beneficiary Age", x = 'Age') + theme_bw()
ggp2 <- ggplot(Insurance, aes(Insurance$BMI)) +
geom_histogram(binwidth = 2,col = 'black', fill = 'blue', alpha = 0.75)+
labs(title = "Distribution of Primary Beneficiary BMI", x = 'BMI') + theme_bw()
ggp3 <- ggplot(Insurance, aes(Insurance$Children)) +
geom_histogram(binwidth = 2,col = 'black', fill = 'lightblue', alpha = 0.50)+
labs(title = "Number of children covered by health insurance plan", x = 'Children')+ theme_bw()
text1 <- paste("Text regarding age goes here break up sentences to make pretty")
text.a <- ggparagraph(text = text1, face = "italic", size = 11, color = "black")
text2 <- paste("Text regarind BMI here break up to make pretty")
text.b <- ggparagraph(text = text2, face = "italic", size = 11, color = "black")
ggarrange(ggp1, text.a, ncol = 2, ggp2, text.b, ggp3, align = "v", common.legend = TRUE)
## Warning: Use of `Insurance$Age` is discouraged. Use `Age` instead.
## Warning: Use of `Insurance$BMI` is discouraged. Use `BMI` instead.
## Warning: Use of `Insurance$Children` is discouraged. Use `Children` instead.
## Warning: Use of `Insurance$Age` is discouraged. Use `Age` instead.
## Warning: Use of `Insurance$BMI` is discouraged. Use `BMI` instead.
## Warning: Use of `Insurance$Children` is discouraged. Use `Children` instead.
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
Jessica: This above is using data from star wars
Perform bivariate analyses on the quantitative variables (direction, strength and form). Describe the linear association between all variables.
Generate a regression equation in the following form:
\[Charges = \beta_{0}+\beta_{1}*Age+\beta_{2}*BMI+\beta_{3}*Children\]
model <- lm(Charges ~ Age + BMI + Children, data = Insurance)
summary(model)
##
## Call:
## lm(formula = Charges ~ Age + BMI + Children, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7795 -5107 -3978 -2406 44936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5962.0 8845.5 -0.674 0.50368
## Age 346.5 118.5 2.925 0.00533 **
## BMI 133.0 277.5 0.479 0.63388
## Children -107.3 1308.1 -0.082 0.93497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10620 on 46 degrees of freedom
## Multiple R-squared: 0.1737, Adjusted R-squared: 0.1198
## F-statistic: 3.223 on 3 and 46 DF, p-value: 0.03109
#Regression #Charges = -5962.00 + 346.50Age + 133.00BMI - 107.30*Children also write out the regression cleanly in this document.
An eager insurance representative comes back with a potential client. The client is 40, their BMI is 30, and they have one dependent. Using the regression equation above, predict the amount of medical expenses associated with this policy. (Provide a 95% confidence interval as well)
newPrediction <- data.frame(Age = 40, BMI = 30, Children = 1)
predict (model, newdata = newPrediction, interval = "confidence", level = .95)
## fit lwr upr
## 1 11782.35 8598.572 14966.13